Evolution of bacterial genomes under horizontal gene transfer 
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Introduction 



Unraveling the evolutionary forces shaping bacterial diversity can today be tackled using a growing 
amount of genomic data. In recent years, the number of completely sequenced prokaryotic genomes has 
increased to around 1700 (NCBI). In particular, first datasets are available for samples of complete 



genomes from closely related strains, which are of the same bacterial species (Medini et al. , 2005 



Tettelin et al. , 2005 2008). Such datasets mark a revolution of microbial evolutionary biology, which 



turned from a theory-rich/data-poor subject into a data-rich/theory-poor field in the last decade. 
While the genome of eukaryotes is highly stable, bacterial genomes from cells of the same species 
highly vary in gene content. For example, the pathogenic strain E. coli 0157:H7 carries 1387 genes 



which are absent in the commensal E. coli K-12 (Perna et al. , 2001). This huge variation in gene 



content led to the concepts of the distributed genome of bacteria and their pangenome ( Tettelin et al. 



2005, Ehrlich et al. , 2005). In datasets, genes present in all genomes of a taxon are called core genes 



while genes present in only some but not all individuals comprise the accessory genome. 

Gene content diversity originates in horizontal exchange of genomic material and pseudogenozation 

followed by gene loss. In particular, the amount of genetic exchange within a bacterial species deter- 



mines the level of clonality (Smith et al. , 1993). There are three different mechanisms of horizontal 



genetic exchange: (a) Transformation is the uptake of genetic material from the environment, (b) 
When a bacterium is infected by a lysogenic virus (phage) it provides additional genetic material that 
can be built in the bacterial genome. This process is known as transduction, (c) Conjugation requires 
a direct link (pilus) between two bacterial cells and leads to exchange of genetic material. These three 
mechanisms are usually referred to as horizontal gene flow. Recently, small virus-like elements called 
Gene Transfer Agents (GTAs) have been hotly debated to be the most important source for horizontal 



genetic exchange in some species (McDaniel et al. , 2010). 



We present a population genetic model for gene content evolution which accounts for several mecha- 
nisms. Gene uptake from the environment is modeled by events of gene gain along the genealogical 
tree, which describes the relationships between the individuals of the population. Pseudogenization 
may lead to deletion of genes and is incorporated by gene loss. These two mechanisms were studied 



by Huson and Steel (2004) using a fixed phylogenetic tree. Taking the random genealogy given by 



the coalescent (Kingman, 1982 Hudson 1983), we studied the resulting genomic diversity already in 



Baumdicker et al. (2010) (see also Baumdicker et al. 2011 ). In the present paper, we extend the model 



in order to incorporate events of intraspecies horizontal gene transfer. Within this model, we derive 
expectations for the gene frequency spectrum and other quantities of interest. 



The model 



We consider the following model for bacterial evolution: Each bacterial cell carries a set of genes and 
every gene belongs either to the core genome or the accessory genome. The infinite set I := [0, 1] is 
the set of conceivable accessory genes and Qc with H / = is the core genome. A population of 
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Figure 1: Genes are with probability 9/{2N) per individual per generation. In this illustration 
only one gene is gained in generation 1 in individual 4. Individuals carrying this gene are shown in 
black. Offspring inherits the gene from its ancestor, unless a loss event occurs (cross), with probability 
p/{2N). With probabihty ^ / {2N^)X{\ — X), a gene is transferred to a random individual, such that 
now both the donor and the acceptor carry the gene. 

constant size consists of N individuals (bacterial cells) . We model the accessory genome of individual 
z by a finite counting measure Gi{t) on /. We will identify finite counting measures with the set of 

atoms, i.e. we write u € Qi{t) if {Gi{t), !„) > 1. 

The population evolves according to Wright-Fisher dynamics. That is, generations are discrete and 
individual (bacterial cell) j in generation t + 1 chooses a parent from generation t purely at random 
and independent of all other individuals at time t + 1. We denote that parent by Aj{t). In order to 
obtain the genome Qj{t + 1), we follow the mechanisms: 

1. Gene loss: Denote the 1 - p/(2A^)-thinning of GAj{t){t) by Gjit + 1). That is, u G Gjit + 1) iff 
u G GAj{t){'t) and an independent coin with success probability 1 — p/{2N) shows a success. 

2. Gene gain: Choose an independent random counting measure Hj(t + 1) according to a Poisson 
process on / with intensity 0/{2N). 

3. Horizontal gene transfer: For every i = 1,...,N (the donor) and v G Gi{t) (the transferred 
gene), let v G T-L'j{t + 1) with probability 7/(2iV^). In this event, individual j is called the 
acceptor of gene v. 

Finally, set 

Gj{t + 1) = {G'jit + 1) + 'H!j{t + 1) + -H'-it + 1)) A 1 

for the genome of individual j in generation t + \. The 'Al'-term indicates that we do not model 
paralogous genes, i.e. horizontal gene transfer events have no effect if the acceptor individual j already 
carries the transferred gene. We refer to {Gi{t), . . . ,GN{t))t=o,i,2,... undergoing the above dynamics as 
the Wright-Fisher model for bacterial genomes with horizontal gene flow. It can be shown that this 
Markov chain is Harris recurrent and hence, has a unique equilibrium. 
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We are mainly interested in large populations and a rescaling of time by a factor of N. The corre- 
sponding limit is usually refered to as large population limit in the population genetic literature. The 
following argument is crucial in the proof of our main result, Theorem [l] Let X^{t) be the frequency 
of gene u in generation [tN] in the population of size A^. Then, in the large population limit, — )• oo, 
the process {X^ {t))t>o converges weakly to the solution of the SDE 

(1) dX ={-^X + - X))dt + - X)dW 

for some Brownian motion W. To see this, note that the evolution of frequencies of gene u is an 
autonomous process. The diffusion term is associated with random reproduction events and has the 



well-known form from ([T]) (Ewens, 2004), known as Wright-Fisher noise. Gene loss reduces X^ with 
probability approximately proportional to X and p/{2N). After rescaling of time, this turns into 
the rate —{pX/2)dt. Last, horizontal gene transfer increases X^ with probability approximately 
proportional to j/{2N^) and to the number of pairs where the horizontal gene transfer events has an 
effect, N'^X^{1 — X^). After rescaling of time, this turns into the rate 2"'^(1 ~ X)dt. 

Sample statistics 

Consider a sample ^i, . . . , ^„ of size n taken from the population. We consider several statistics under 
the above dynamics: 

The average number of genes ( in the accessory genome ) is given by 



1 " 

(2) A := ■=-Y^ 



n 

i=l 

where \Qi\ := {Qi, 1) is the total number of accessory genes in individual i. 
The average number of pairwise differences is given by 

(3) i):=i)("):=^_ ^ \g,\g^\ 

nin — 1) ^-^ 

where Qi \ Qj := {Qi — Qj)~^ are the genes present in i but not in j. 
The size of the accessory genome is given by 



(4) G := ■.= \\JQ^ 



i=l 

where UILi Si = SILi A 1 is the set of genes present in any individual from the sample. 

The gene frequency spectrum (of the accessory genome) is given by Gi := gS"\ . . . , Gn ■■= G'n\ where 

(5) G^^^ := Gk '■= \{u £ I : u ^ Qi for exactly k different i}\. 

Results 

Using diffusion theory, we obtain first moments of all of the above statistics in equilibrium. We start 

(n) (n) 

with expectations of G\ , . . . , Gn , since all other quantities can be expressed in terms of the gene 
frequency spectrum. The proof of Theorem [T] can be found at the end of the manuscript. 

Theorem 1 (Gene frequency spectrum). Consider a sample of size n taken from the Wright-Fisher 
model for bacterial genomes with horizontal gene flow with p > 0,9 > 0,^ > in equilibrium. Then, 
as N ^ oo, 

/f5^ ^i^nh ^ 0_ n---{n-k + l) , jkUl"" 

^ ' k{n-l + p)---{n-k + p)\ ^ ^ {n + p)ram\ 

\ I / \ 11 m=l 

with {a)iy := a{a + 1) • • • (a + 6 — 1). 
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Figure 2: The expected gene frequency spectrum is highly dependent of 7, the rate of horizontal gene 
flow. For high values of 7, most genes are in high frequency, leading to a closed pangenome. We use 
p = 2 in the figure. 



Corollary 2 (More summary statistics of gene content). 
Under the same assumptions as in Theorem^ 
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(7) E[A(")] = 1 + E n 



I (1 + P)m 

m=l ^ ' , 



00 



(8) E[I)(")] = — 



1 + P V ^1 (2 + p)„ 

\ m=l 
n — 1 1 00 „ 
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(9) iE[G(")]=e5:^+ej: 



fc=0 



Discussion 



We introduced the Wright-Fisher model for bacterial genomes with horizontal gene flow, measured 
by the parameter 7. Since the corresponding model without horizontal gene flow was considered in 
Baumdicker et al. (2010), we note that Theorem [T] and Corollary [2] imply that all sample statistics are 
continuous at 7 = 0. 



Recently, the concepts of open and closed pangenomes were introduced (Medini et al. , 2005). If, after 
sequencing a flnite number of genomes, all genes present in the population are found, one speaks of a 
closed pangenome. If new genes are found even after sequencing many cells, the pangenome is open. 
It is not hard to see that high values of 7 imply that most genes are in high-frequency. In other words, 
sequencing a new individual hardly leads to new genes which were not seen before. This impact of 
openness and closedness of the pangenome can as well be seen from Figure [2| 



Consider the diffusion Q, which describes the approximate frequency of individuals carrying one 
speciflc gene u £ I. Usually, the '^X{1 — X)-ter:m appears in population genetic models only due to a 
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because horizontal gene flow increases the frequency of the gene by a rate which is proportional to the 
number of possible donor/acceptor-pairs of individuals. 

Due to the close connection of horizontal gene transfer with selective models, a comparison to recent 
work is appropriate. In particular, the theory for the frequency spectrum in selective models with irre- 
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a Poisson Random Field model for selective sites. (Extensions 
2005 ) They assume that a large set of unlinked loci is under 
selection. As a result, they obtain predictions for the number of alleles present in a subset k out of n of 
individuals. However, since their loci are unlinked, the random processes are completely independent 
for the different processes. This is in contrast to our approach where the reproduction within the 
Wright-Fisher model affects all genes in the same way, and the horizontal gene transfer only affects 
single genes. Moreover, our model is reversible in the sense that present genes may as well be lost (but 
not reintroduced). Since it has been shown in Baumdicker et al. (2010) that discontinuities at p = 



(no gene loss) arise, it is not straight-forward to use these classical results in the present setting. 
Simulations 



We are interested in patterns of presence/absence of genes in a sample of size n in an equilibrium 
situation. If presence/absence of a gene would be independent of the state of the other genes, the 
gene frequency spectrum could be simulated by independent copies of the diffusion X, given in ([T|. 
However, all genes are inherited along the same lineages, so the frequency of two different genes depend 
on each other; see Figure |4j For 7 = the composition of the accessory genome of n individuals can 
be simulated backwards in time using the coalescent. As, for 7 = 0, genomes only depend on events 



along the ancestral lineages of the sample this is very efficient (Hudson, 2002). In the case 7 > 0, we 
simulate the accessory genome forward in time. Therefore, we have to consider all individuals of the 
population, as each of them might influence the accessory genome of the sample of size n by events 
of horizontal gene transfer. Unfortunately the size of the population has to be much larger than n to 
obtain values close to the large population limit results. Thus the forward simulations for 7 > are 
much slower than backward simulations for 7 = 0. 

Theorem [1] gives the expected sizes of the gene frequency spectrum. For 7 = it is known that the 
gene frequency spectrum highly depends on the underlying genealogy. E.g. if the genealogy separates 
the n individuals into two groups, one of size k and one of size n — k, then G^"^ and G^}^j^ increase. 
The same is true for simulated data with 7 > 0, see Figure [3j 

Proof of Theorem [l] and Corollary [2] 

We consider the diffusion ([T]) with infinitesimal mean and variance 

b{x) = —\px + 57x(l — x), a{x) = x{l — x). 

The Green function for the diffusion, measuring the time the diffusion, i.e. a gene, spends in frequency 
x until eventual loss, if the current frequency is is given by 



G{5,x) 



a{x)ip{x) ' 



where 



b{z) 



i;{y) := exp (^-2 ^^dz ) = (1 - y)^-Pe-^y 

(x) := / ip{y)dy. 
Jo 
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Figure 3: The expected gene frequency spectrum for n = 10, = 10, 7 = 6 and p = 2 is shown as a 
soHd black Une. For twelve different simulations with = 500 individuals the gene frequency spectrum 
for n = 10 randomly chosen individuals is shown (dashed lines). The mean of 1000 simulations (black 
circles) is close to the results of Theorem [T] 
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Figure 4: The frequencies for two different genes in a population of size 2000 are shown. Here p = 0.2 
and 7 = 1. At time zero initially 600 individuals carry gene 1, 800 individuals carry both, gene 1 and 
gene 2 and 600 individuals carry none of the two genes. The frequencies are not independent as both 
genes depend on the same underlying ancestral lineages. Since gene loss and horizontal gene transfer 
events occur independently for each gene, they can weaken the dependency of the two frequency paths. 
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Following Durrett (2008), we introduce new genes in frequency 5 at rate I^tttt in a consistent way. 



2 0(5) 



That is, the gene raises in frequency to e > 6 with probability ^y^. Hence the number of genes in 
frequency x is Poisson with mean 

6 1 e^^ 

-G{s,x) = e- 



2(j){6) ^ ' ^ x{l-xy~p' 
The gene frequency spectrum is now given by 



J\'y''x''-\i-xr'''-^+pdx 
,)(^--')! r(n + rt '^''^^" + ''^^> 



oo 



where iFi{k; n+p; 7) = 1+ X] (1+^^ ^ hypergeometric funtion and {a)b ■= a(a + l) • • • (a+b — l) 



m=l 



{n+p)mm 



Given the gene frequency spectrum, it is now easy to compute first moments of A, D and G (see 
Corollary [2]) by using 



n 71 „ , 00 ^ 00 n— 1 ^ 

k=l k=l 771=0 ?n=0 fc=0 



„A; + p ^ m f-;' V (/c + p)m (A; + l + /9), 

A;=0 m=l k=0 

1 °° 'v'" / 1 1 

+ P 7^1 ^ (^^^"^ + 
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